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ABSTRACT 



A substantial amount of empirical research suggests that cognitive ability test scores are 
increasing by approximately three IQ. points per decade. The effect, referred to as the Flynn effect, 
has been found to be more substantial on measures of fluid intelligence, a construct known to be 
substantially correlated with memory span. Miller (1956) suggested that the typical short-term 
memory capacity (STMC) of an adult is seven, plus or minus two objects. Cowan (2005) suggested 
that the typical working memory capacity (WMC) of an adult is four, plus or minus one object. 
However, the possibility that both STMC and WMC test scores may be increasing across time, in 
line with the Flynn effect, does not appear to have been tested comprehensively yet. Based on 
Digit Span Forward (DSF) and Digit Span Backward (DSB) adult test scores across 85 years of data 
(respective Ns of 7,077 and 6,841 ), the mean adult verbal STMC was estimated at 6.56 (±2.39), 
and the mean adult verbal WMC was estimated at 4.88 (±2.58). No increasing trend in the STMC 
or WMC test scores was observed from 1923 to 2008, suggesting that these two cognitive 
processes are unaffected by the Flynn effect. Consequently, if the Flynn effect is occurring, it would 
appear to be a phenomenon that is completely independent of STMC and WMC, which may be 
surprising, given the close correspondence between WMC and fluid intelligence. 

® 2014 Elsevier Inc. All rights reserved. 



1. Introduction 

One of the most sensational scientific observations in tiie 
area of contemporary intelligence research is that intelligence 
test scores have increased since about 1930 (Flynn, 2012; Lynn, 
1982). The reported effect is not small, as it corresponds 
to approximately three IQ points per decade (Flynn, 2007; 
Neisser, 1998). Furthermore, the consequences are not negli- 
gible, as Flynn (1987) contended that the "...gains suggest that 
IQ tests do not measure intelligence but rather a weak causal 
link to intelligence" (p. 190). The precise nature and causes of 
the "Flynn effect' remain enigmatic (Williams, 2013). Further- 
more, a number of limitations associated with studies support- 
ive of the Flynn effect have been articulated, including invalid 
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test score comparisons due to changes in test items and 
administration across editions (Kaufman, 2010), changes in the 
rate of human cognitive development in both the young and 
the elderly (Parker, 1986), changes in standard deviations 
(Rodgers, 1998), as well as the absence of factorial invariance 
associated with intelligence battery test scores Must, te 
Nijenhuis, Must, & van Vianen, 2009; Wicherts et al., 2004). 
Consequently, the purpose of this investigation was to examine 
the Flynn effect on several normative samples at the observed 
score level on possibly the only subtest of intellectual function- 
ing that has essentially not changed for over a century: Digit 
Span. As Digit Span incorporates both foward and bacl<ward 
recall items, an additional purpose of this investigation was 
to estimate precisely the typical verbal short-term memoiy 
capacity (STMC) and working memory capacity (WMC) of 
adults, so as to verify the proposed values reported by Miller 
(1956; 7 ±2) and Cowan (2005; 4 ±1). 
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3.3. Overview of the Flynn Effect 

The accumulated research suggests that the Flynn effect is 
more pronounced on fluid intelligence tests, in comparison to 
tests likely to be affected by education, such as vocabulaiy and 
knowledge of worldly facts (Flynn, 2007; Ronnlund, Carlstedt, 
Blomstedt, Nilsson, & Weinehall, 2013). In a relatively recent 
investigation, Flynn (2009a) reported ongoing gains (1943 
to 2008) in British children (5.5 to 11 years old) as measured 
by the Raven's Progressive Matrices (Raven, Court, & Raven, 
1 986; Raven, Rust, & Squire, 2008). Additionally, Flynn (2009b) 
reported continued (1995-2006) IQincreases equal to three IQ 
points per decade in adults based on the Wechsler scales. 
Based on an examination of the Seattle Longitudinal Study 
(SLS) database, Schaie, Willis, and Pennak (2005) reported a 
Flynn effect equal to approximately Vi of a standard deviation in 
cognitive ability test scores between birth cohort 1931 and 
birth cohort 1952. As the results were most pronounced for 
inductive reasoning, Schaie et al. (2005) recommended that it 
would be insightful to evaluate possible test score changes 
across time in fluid type capacities more basic than inductive 
reasoning. 

Arguably, one such relatively elementaiy cognitive ability 
construct is memoiy span. Individual differences in memory 
span (WMC in particular) are known to be correlated 
substantially with fluid intelligence. Based on a meta- 
analysis, Kane, Hambrick, and Conway (2005) estimated that 
approximately 50% of the true score variance between WMC 
and fluid intelligence is shared. Based on the WAIS-IV normative 
sample, Gignac (2014) suggested that the shared variance may 
be closer to 60%. The substantial empirical association between 
WMC and fluid intelligence is considered an important 
phenomenon, as it has been theorised that WMC is a critical 
determinant, or rate limiting factor, in the performance of fluid 
intelligence tasks (Carpenter, Just, & Shell, 1990; Fiy & Hale, 
1996). Oberauer, Su, Wilhelm, and Sander (2007) proposed that 
the association between WMC and fluid intelligence is embed- 
ded by the central neivous system in such a way that only a 
limited number of bindings can be created to facilitate the 
development of novel relational representations. Consequently, 
given the close correspondence between WMC and fluid 
intelligence on both empirical and theoretical grounds, the 
reported increases in fluid intelligence test scores (Flynn, 
2007) would arguably be expected to be associated with 
concomitant increases in memory span, particularly WMC. 

3.2. The case for Digit Span 

One of the most commonly used tests of memory span is 
Digit Span (Blankenship, 1938; Dempster, 1981). According to 
Bronner, Healy, Lowe, and Shimberg (1927), Digit Span was in 
use as early as 1 887. Digit Span's popularity was established by 
virtue of the fact that it was included in both of the intelligence 
batteries that emerged as the most popular in the early 20th 
century: the Stanford-Binet (Terman, 1917); and the Wechsler- 
Bellevue scale (W-B; Wechsler, 1939). Although there are 
several slight variations of the Digit Span subtest, typically, the 
test consists of administering several series of single digits to be 
recalled in a particular order. In most cases, the number of 
digits within a series ranges from 3 to 9. There are two common 
forms of the Digit Span test: Digit Span Forward (DSF), where 



the digits need to be recalled in the order with which they were 
presented, and Digit Span Bacl<ward (DSB), where the digits 
need to be recalled in the reverse order with which they were 
presented. 

Although Digit Span was initially considered a relatively 
poor measure of intellectual functioning (Matarazzo, 1972; 
Wechsler, 1939), such a position appears to be based more 
on presumption and clinical experience, rather than rigorous 
statistical evidence (Bachelder & Denny, 1977; Verive & 
McDaniel, 1996). For example, Wechsler (1939) presumed 
that there was not a sufflcient amount of variability in Digit 
Span scores to be a high quality discriminator of intelligence, 
as approximately 90% of the adult population appeared to 
recall somewhere between five and eight digits. Additionally, 
Wechsler (1939) claimed that both DSF and DSB correlated 
poorly with other intelligence subtests and contained little of g. 
However, Wechsler's (1939) own reported results do not 
support such a position. First, based on the Wechsler-Bellevue 
(Wechsler, 1939) normative sample (ages: 20-34, N=355), 
Digit Span was associated with a mean inter-subtest correla- 
tion of .38, which is comparable to the mean inter-subtest 
correlation of .44 for the whole battery. Additionally, based on 
the same portion of the normative sample, Wechsler (1939) 
reported the corrected subtest- FSIQ correlation (a reasonable 
proxy of a g component loading) associated with Digit Span at 
.51, which, arguably, was not substantially smaller than the 
average cortected subtest-FSIQcortelation of .61. More recently, 
based on the Wechsler Adult Intelligence Scale - IV (WAIS-IV; 
Wechsler, 2008) normative sample (N= 2,200) and a bifactor 
model, Gignac (2014) found that DSF and DSB were associated 
with g loadings of .46 and .58, respectively, which would suggest 
that both subtests are moderate indicators of g. Disattenuated 
for imperfect reliability in subtest scores, the corrected g 
loadings corresponded to .51 and .64, respectively. Jensen and 
Figueroa (1975) also found that DSB correlated more signifi- 
cantly with g than DSF. Thus, although Digit Span is 
certainly not an excellent indicator of g, it is arguably a 
fair to good indicator of intellectual functioning, particu- 
larly DSB. 

Digit Span has also been obsei"ved to share variance with a 
number of socially important variables. For example, Frank 
(1983) reviewed four studies (seven independent samples) 
which examined the association between the Wechsler sub- 
scales and grade point average. Digit Span was associated with a 
mean validity coefficient of .35, which was veiy comparable to 
the mean validity coefficient of .37 across all 11 subtests. Digit 
Span has also been found to correlate with years of education 
completed (r = .44, Paul et al., 2005; r = .43, Birren & Morrison, 
1961), reading comprehension (r = .30; Daneman & Merikle, 
1996; Norman, Kemper, & Kynette, 1992), and brain volume 
(r = .41 ; Wickett, Vernon, & Lee, 2000). Additionally, amongst a 
battery of cognitive ability tests. Digit Span was found to be the 
best predictor of academic achievement amongst learning- 
problem children (Sewer, Shapiro, & Shapiro, 1972). Digit 
Span has also been found to be a respectable predictor of job 
performance (medium cognitive demands: r = .51; Verive 8; 
McDaniel, 1996). Finally, Miller and Vernon (1992) found that 
the association between reaction time and g was mediated 
by individual differences in short-term memory span. Thus, 
in light of the above, it is likely tenable to suggest that Digit 
Span is somewhere between a moderate to good indicator of 
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intellectual functioning on both empirical and theoretical 
grounds (Bachelder & Denny, 1977). 

Although there is some evidence to suggest othewise 
(e.g., Colom, Flores-Mendoza, Quiroga, & Privado, 2005), it 
is relatively widely recognised that DSB is a better measure 
of working memory capacity (WMC) than DSF (Hedden & 
Gabrieli, 2004; Oberauer, Stil?, Schulze, Wilhelm, & Wittmann, 
2000). Working memoiy (WM) is considered different from 
short-term memoiy (STM) in that WM requires the mainte- 
nance and the manipulation (or transformation) of information 
temporarily during cognitive activity (Baddeley & Hitch, 1974; 
Baddeley, 2002; Oberauer et al., 2000). STM, by contrast, is 
considered to require only the maintenance of objects in 
memoiy. Theoretically, DSB is considered a measure of WMC, 
as the requirement to mentally re-order the digits is considered 
a form of cognitive manipulation (Oberauer et al., 2000). 
Empirically, DSB has also been found to correlate more 
substantially with other measures of WMC, in comparison to 
DSF (Redick & Lindsey, 2013). Consequently, it was hypothe- 
sized that STMC and WMC, as measured by DSF and DSB, 
respectively, would evidence substantial increases across time, 
in line with the Flynn effect observed for fluid intelligence 
measures such as Raven's (Flynn, 2007). Furthermore, as DSB is 
a better indicator of WMC, it was hypothesised that the Flynn 
effect would be more substantial for DSB than DSF. 

1.3. What is the Average Memoiy Span? 

In a classic paper. Miller (1956) contended that the mean 
maximal number of serially processed objects a healthy adult 
can store in STM is seven, plus or minus two. Arguably, Miller's 
(1956) assertion was based on a relatively small amount 
of empirical research and a liberal amount of speculation, 
rather than a comprehensive quantitative review of the STMC 
research. Despite this. Miller's (1956) magical number seven 
(plus or minus two) continues to be widely recognised 
(Goldstein, 2010). Of course, there are some critics of Miller's 
law, with authors that the values seven, plus or minus two, are 
too high or too low (Dehn, 2008). Although a substantial 
amount of STMC empirical research has accumulated since 
Miller (1956), much of this research has been based on 
relatively small (N <25), non-representative samples 
(i.e., university students), and somewhat different tasks and 
scoring protocols. Consequently, a meta-analysis does not seem 
feasible in order to estimate, precisely, the mean STMC of 
healthy adults. However, as described above, one important 
exception is the Digit Span test, which has been used in the area 
of intellectual assessment for over a century (Bolton, 1892; 
Wechsler, 2008). 

In contrast to STMC, WMC tests are widely considered to be 
more difficult, as they require the maintenance of information in 
STM, as well as the simultaneous manipulation (or transforma- 
tion) of that information (Oberauer et al., 2000). Along the lines 
of Miller's (1956) magical number seven. Cowan (2005, 2010) 
proposed that the mean maximal WMC for a healthy adult is 
four objects, plus or minus one. Based on a series of experiments 
with novices and experts at chess, Gobet and Clarkson (2004) 
argued that Cowan (2010) magical number four was an 
overestimate by one object, as the natural WMC of healthy 
adults appeared to be closer to three objects. Although 
insightful, Gobet and Clarkson's (2004) study was based on a 



sample of twelve individuals, which would arguably not be 
considered sufficiently large to publish firm statements about 
the mean level of WMC in the broader adult population. In fact, 
much of the WMC research suffers from the same limitation 
as that identified for STMC: small, unrepresentative samples. 
Fortunately, however. Digit Span is often administered with the 
inclusion of both foii/vard (i.e., DSF) and baclwvard (i.e., DSB) 
span items. Consequently, as Digit Span has been administered 
within a relatively large number of high quality normative 
samples (e.g., Wechsler scales and others), there was the 
opportunity to estimate very precisely the typical (i.e., mean) 
verbal STMC and verbal WMC of healthy adults, which was a 
secondaiy purpose of this investigation. 

In addition to the typical STM and WM capacities of adults, 
it was considered useful to estimate the amount of variability 
in STMC and WMC within the adult population. Miller's law 
(i.e., 7 ± 2) suggests that approximately 95% of individuals' 
STMC lie somewhere between five and nine objects, which 
implies that STMC is associated with a standard deviation of 
approximately 1 (i.e., 1 * ±1.96).^ Cowan's law (4± 1) suggests 
that approximately 95% of individuals' WMC lie somewhere 
between three and five objects, which implies that WMC is 
associated with a standard deviation of approximately .50 
(.50 * ±1.96). From a coefficient of variation perspective 
(SD / M), Miller's law and Cowan's law imply that STMC and 
WMC are associated with standardized variability estimates of 
.14 (1 / 7) and .13 (.50 / 4), respectively. Arguably, based on 
these values, the amount of STMC and WMC variability may 
be considered rather low, in comparison to other cognitive 
capacities. For example, based on the WAIS-IV normative 
sample means and standard deviations reported in Beaujean 
and Sheng (2014), the mean coefficient of variation associated 
with nine of the WAIS-IV subtests (45-54 year olds; N = 200) 
was calculated by me to be .27 (range: .21 to .36). Additionally, 
based on the normative sample means and standard deviations 
(18-30 year olds) associated with the BIRT Memoiy and 
Information Processing Battery (BIMPB; Oddy, Coughlan, & 
Crawford, 2007) reported in Baxendale (2010), list recall and 
design recall were calculated by me to be associated with 
coefficients of variation of .22 and .24, respectively. Conse- 
quently, Miller's law and Cowan's law imply substantially less 
variability in STMC and WMC than other cognitive capacities. 

It will be noted, however, that memory span has long been 
suggested to be associated with relatively little variability in 
human capacity (Sattler, 1982; Wechsler, 1939). Furthermore, 
the lack of variability has been articulated to be a reason to 
consider measures such as Digit Span to be relatively weak 
indicators of intellectual functioning (Matarazzo, 1972). How- 
ever, the contention that STMC and WMC are associated with 
relatively low levels of variability does not appear to yet have 
been tested specifically across a number of normative samples 
and a standardized representation of variability (i.e., coefficient 
of variation). Thus, a secondary purpose of this investigation 
was to estimate precisely the verbal STMC and verbal WMC 
means, standard deviations, and coefficients of variation in the 



' The value of 1.96 corresponds to 95% of the standard normal distribution. 
Multiplying the standard deviation by the 95% standard normal deviate 
(i.e., 1.96) is presumed to correspond to the ±2 value associated with Miller's 
law. In this case, a standard deviation of one corresponds to a ± value of very 
nearly 2. 
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healthy adult population, so as to verify the values proposed by 
Miller (1956; 7 ±2) and Cowan (2005; 4 ±1). 



3.4. Flynn Effect and Memory Span: previous research 

Investigations which have examined the possibility of 
cognitive ability test score increases across time tend to have 
done so at the aggregate level. For example, Parker (1986) 
examined FSIQ differences across the W-B, the WAIS, and the 
WAIS-R, but did not report results at the subscale level. 
Similarly, Flynn's (1987) comprehensive investigation was 
based principally upon total scale scores (FSIQ, VIQ, PIQ). In 
order to help understand more fully the nature of the Flynn 
effect, recent research has focussed upon the examination of 
test score changes at the subscale level (Flynn, 2007). For 
example, Beaujean and Sheng (2014) examined mean level test 
score differences across the WAIS, WAIS-R, WAIS-III, and WAIS- 
IV at the subscale level. As the raw data were not available, 
Beaujean et al. identified the subscale raw score means that 
corresponded to a scaled score of 10 for each subtest to 
determine whether scores increased across editions/time. 
Beaujean et al. reported substantial Digit Span subtest mean 
increases across time. On the surface, the procedure used by 
Beaujean et al. may seem valid. However, such a methodology 
would only be valid if the number of items, as well as scoring 
procedure, remained constant across editions. In fact, there are 
a large number of changes in the number of items within 
subtests and scoring procedures across Wechsler editions, 
many of which compromise the interpretation of a substantial 
amount of published Flynn effect research (Kaufman, 2010; but 
see also Flynn, 2010). 

In comparison to other Wechsler subtests, there have 
been relatively few changes to the Digit Span subtest over the 
years. There are two significant changes that are important 
to consider, however. First, the WAIS-R (and later editions) 
awarded up to a maximum of two points for recalling correctiy 
both trials associated with a Digit Span item. By contrast, within 
the W-B and the WAIS, the DSF and DSB scores simply reflect 
the largest digit series recalled correctly. Thus, it would 
naturally be expected that the WAIS and the W-B Digit Span 
raw scores would be lower than those observed in later 
Wechsler editions. In fact, the Digit Span Total (DST) raw score 
mean that corresponds to a scaled score of 10 within the WAIS 
is 11 and increased to 15-16 in the WAIS-R (ages 20-24 years). 
The difference of four to five points may simply reflect the 
change in scoring, not necessarily a change in memory span 
ability. Secondly, a new Digit Span subtest was added to the 
WAIS-IV (Digit Span Sequencing, DSS), and the scores associ- 
ated with DST were based on the sum of DSF, DSB, and DSS. 
Thus, DST within the WAIS-IV is based on the sum of three 
subtests, rather than two subtests. Naturally, the DST raw 
score mean that corresponds to a scaled score of 10 within the 
WAIS-III to the WAIS-IV increased from 17-18 to 28-29 (ages 
20-24 years). Again, such an increase would not necessarily 
reflect an increase in ability across time, but, instead, the 
change in the scoring. Fortunately, the WAIS-R, WAIS-Ill, and 
WAIS-IV reported additional tables in their technical manuals 
that include the 'longest digit span fon/vard' and 'longest digit 
span bacl<ward' means and standard deviations across all 
age groups. These values facilitate valid comparisons across 



Wechsler editions, as well as other publications which used a 
comparable Digit Span scale, as will be described further below. 

In addition to Beaujean and Sheng (2014), Daley, Whaley, 
Sigman, Espinosa, and Neumann (2003) reported a Digit Span 
Forward test score increase equal to a Cohen's d = — .19, based 
on two samples of Kenyan children tested between 1984 
(N=118) and 1998 (N=537). Much larger differences were 
observed for Raven's and a vocabulary test. It will be noted that 
Daley et al. also reported a substantial reductions in test score 
standard deviations across time (25%-30% smaller), which 
suggests that the changes in test scores may have been due 
principally to improvements at the lower end of the distribu- 
tion. Unfortunately, although the 1984 sample was somewhat 
normative in nature, it was rather small in size. Furthermore, 
the second sample was essentially a convenience sample, 
which makes valid interpretations of the comparisons difficult. 
Finally, as the samples were based on children, the test 
score changes could have arisen due to changes in the rate of 
maturational development in children across time. 

In addition to the small number of Flynn effect investiga- 
tions relevant specifically to Digit Span, a small number of 
Flynn effect studies have included other measures of memoiy 
span. For example, based on the normative samples associated 
with the Adult Memory and Information Processing Batteiy 
(AMIPB; Coughlan & Hollowes, 1985; Oddy et al., 2007), 
Baxendale (2010) found virtually no mean differences across 
groups on the list recall task. However, a mean increase across 
the two normative samples of approximately half of one 
standard deviation was observed for the design recall task. 
Baxendale (2010) offered little in the way of explanation for 
why the effect was obsereed for spatial but not verbal recall, 
except to suggest that the two processes are not perfectiy 
correlated. It is probably important to note that a large number 
of the items within the AMPIB changed from the 1985 to 2007 
editions (Oddy et al., 2007). 

In another investigation, Ronnlund and Nilsson (2008) 
found that an episodic memory latent variable mean increased 
by .60 of a z-score from 1988-1990 to 2003-2005, which 
would be suggestive of a Flynn effect. However, an arguably 
distinct limitation associated with Ronnlunda et al. is that the 
individuals selected to participate in the investigation were 
drawn exclusively from a single regional town in Sweden 
(Umea, population: 110,000). Also, the mean age of the 
samples included in Ronnlunda et al. was relatively old, as no 
participants under the age of 35 years were included. Thus, the 
results observed in Ronnlunda et al. may be due to the overall 
health improvements reported in the elderly over the years 
(Jeune & Bronnum-Hansen, 2008). 

Consequently, in light of the above, the purpose of this 
investigation was twofold: (1) to estimate across a combina- 
tion of normative samples the typical verbal STM and WM 
capacities of healthy adults; and (2) to test the hypothesis that 
the verbal STM and the verbal WM capacities of adults have 
increased across time in line with the Flynn effect. 

2. Method 

2.3. Samples and measure 

In order to address the two principal questions posed in this 
investigation, the results associated with several publications 
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Table 1 

Mean and standard deviations associated with Longest Digit Span Forward (LDSF) and Longest Digit Span Bacltward (LDSB) across time. 



Source 


Year 


N 


Ages 


LDSF 


LDSB 


DST 


Wells and Martin 


1923 


50 


Adults 


6.3 (NA) 


5.1 (NA) 


11.40 


Wechsler 


1933 


236 


Adults 


6.60 (1.13) 


NA 


NA 


Weisenburg et al. 


1936 


70 


18-59 


6.69 (1.02) 


4.87 (1.16) 


11.56 


W-B 


1939 


1,081 


17-70 


NA 


NA 


12.00 


WMS 


1945 


96 


20-49 


653 (1.17) 


4.80 (1.12) 


11.23 


WAIS 


1955 


1,785 


16-75 


NA 


NA 


11.00 


WAIS-R 


1981 


1,880 


16-74 


6.45 (1.33) 


4.87 (1.43) 


11.32 


MAS 


1991 


845 


18-90 


663 (1.22) 


4.83 (1.30) 


11.46 


WAIS-III 


1997 


2,000 


16-74 


659 (1.35) 


4.85 (1.49) 


11.44 


WAIS-IV 


2008 


1,900 


16-74 


672 (1.31) 


4.84(1.39) 


11.56 






N-weighted M 




6.56 (1.22) 


4.88(1.32) 


11.44 



Note. Wells and Martin (1923) created a normative sample group for the purposes of studying psychopathology; Wechsler (1933) published normative Digit Span 
Forward data to compare the variability associated with a large number of human characteristics; Weisenburg et al. ( 1 936) created a normative sample group for the 
purposes of studying aphasia; DST = Digit Span Total; W-B = Wechsler-Bellevue; WMS = Wechsler Memory Scale; WAIS = Wechsler Adult Intelligence Scale; 
MAS = Memoiy Assessment Scales; NA = not available. 



(journal articles, books, and technical manuals) were compiled. 
Across all selected publications, a largely identical Digit Span 
test was administered. Specifically, the nature of the Digit Span 
test considered for inclusion in this investigation consisted of a 
series of digits read to the participant orally at a rate of one digit 
every second. The participant had to repeat the digits orally. In 
the case of DSF, the digits had to be repeated in the order with 
which they were read. In the case of DSB, the digits had to be 
repeated in the reverse order with which they were read. DSF 
and DSB are typically recognised as measures of verbal STMC 
and verbal WMC, respectively (Oberauer et al., 2000). In all 
cases, the means included in this investigation corresponded to 
the largest series of digits recalled correctly. In almost all cases, 
the number of digits within a series ranged from three to nine 
for DSF and two to eight for DSB. In most cases, the means and 
standard deviations associated with DSF, DSB, and DST were 
available. However, in some cases, only the results for DSF or 
DST were available. 

The sources/samples included in this investigation are 
listed in Table 1. It can be observed that the Digit Span 
normative sample results associated with the Wechsler- 
Bellevue (W-B; Wechsler, 1939)^, the Wechsler Adult Intelli- 
gence Scale (WAIS; Wechsler, 1955), the Wechsler Adult 
Intelligence Scale - Revised (WAIS-R; Wechsler, 1981), the 
Wechsler Adult Intelligence Scale - III (WAIS-III; Wechsler, 
1997) and the WAIS-IV (Wechsler, 2008) were included in the 
analysis. With respect to the W-B and the WAIS, the means and 
standard deviations associated with DSF and DSB were not 
repotted. However, the raw score DST values (DSF -I- DSB) 
which corresponded to a scaled score of 10 (i.e., the scaled 
mean) were published and included in this investigation. In 
the cases of the W-B and the WAIS, the raw score that 
corresponded to a scaled score of 10 was considered appropri- 
ate for inclusion in this investigation, as the DSF score and the 
DSB score corresponded to the number of digits in the longest 
series recalled accurately (Wechsler, 1939, 1955). Further- 
more, the DSB and the DSF scores were added together to form 
the DST score. With respect to the WAIS-R, WAIS-III, and the 



^ The Wechsler-Bellevue (Wechsler, 1939) was normed on a total sample of 
1750 subjects, however, 670 of those were children as young as 7 years. The 
adult portion of the normative sample amounted to 1071 participants 
(Wechsler, 1958. p. 87).. 



WAIS-IV, the 'longest digit span forward' (LDSF) and the 
'longest digit span backward' (LDSB) means and standard 
deviations were repotted in supplemental tables within their 
respective technical manuals.^ Thus, LDSF was considered a 
comparable estimate of DSF and LDSB was considered a 
comparable estimate of DSB. Furthermore, LDSF added to 
LDSB was considered an estimate of DST. In order to increase 
the coinparability of the WAIS-R, WAIS-III, and WAIS-IV 
normative sample scores with the other sources included in 
this investigation (all of which did not include very old 
participants), I calculated the N-weighted means based on 
the LDSF and LDSB values associated with the 16 to 74 year 
old age groups, instead of simply using the total sample (16 
to 90 years) normative sample LDSF and LDSB means and 
standard deviations. 

The DSF and DSB results associated with the Wechsler 
Memory Scale (WMS; Wechsler, 1945) normative sample were 
included in this investigation"*, as the Digit Span subtest was 
essentially identical to that included in the WAIS (Wechsler, 
1955). However, the normative sample results associated with 
the WMS-R (Wechsler, 1997) were excluded, because the 
WMS-R normative sample was not widely age representative. 
Specifically, the norms associated with the WMS-R used 
interpolated values for several age groups from 18 to 45 years 
of age (Elwood, 1991). The WMS-III (Wechsler, 1987) Digit 
Span norms were also excluded, as they were identical to those 
associated with the WAIS-III (Wechsler, 1997). Finally, Digit 
Span was not included in the WMS-IV (Wechsler, 2008). In 
light of the above, with respect to the Wechsler Memory Scales, 
only the results associated with the WMS (Wechsler, 1945) 
were included in this investigation. 

With respect to non-Wechsler scales, the Digit Span 
normative sample results associated with the Memory Assess- 
ment Scales (MAS; Williains, 1991) were included, as the MAS 
Digit Span test is essentially identical to the Digit Span test 
included in the WAIS-R (Wechsler, 1981). Although the MAS 



^ The LDSF and LDSB means and standard deviations associated with the 
WAIS-R normative sample were reported in the WAIS-R Nl technical manual 
(Kaplan, Fein, Morris, & Delis, 1991). 

" The Wechsler Memoiy Scale (Wechsler, 1945) was normed on a sample of 
200 healthy adults (ages 25 to 50), however, the DSF and DSB raw score means 
and standard deviations were reported for only 96 of the adults. 
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technical manual does not include the raw score means and 
standard deviations for DSF and DSB, they were supplied to me 
via email (M. Williams, personal communication, June 10, 
2014). Less well-known are the three oldest sources included 
in this investigation. Weisenburg, Roe, and McBride (1936) 
created a normative sample group for the purposes of studying 
and diagnosing aphasia. To this effect, a control group of 70 
adults were selected from three hospitals in Pennsylvania. 
Although the participants included in the normative sample 
were admitted to hospital, individuals who were suffering 
from any psychological condition were excluded. A battery of 
intelligence tests was administered to the participants, includ- 
ing a Digit Span test The items ranged from five to eight digits 
for DSF and three to seven digits for DSB. Weisenburg et al. 
reported the means and standard deviations separately for DSF 
and DSB. Next, Wechsler (1933) published a normative sample 
mean and standard deviation associated with DSF. Although 
it is impossible to be certain, it would seem reasonable to 
presume that the version of Digit Span used by Wechsler 
( 1 933 ) was the same as that which made its way into the well- 
known Wechsler scales. Finally, Wells and Martin (1923) 
created a normative sample group for the purposes of studying 
psychopathology. Several tests were administered to the 
normative sample group, including Digit Span. The DSF portion 
of the test consisted of series of digits ranging up to nine digits, 
and the DSB portion of the test consisted of series of digits 
ranging up to eight digits. 

A number of ostensibly useful sources of data were 
excluded, as they were judged not to have administered a 
sufficiently similar Digit Span test, or the results were not 
reported in a comparable manner. Also, some sources were not 
based on a sample sufficiently representative to be considered 
reasonably normative. Notable exclusions were Russell's (1975, 
1988) revision of the WMS, the Stanford-Binet (S-B) intelli- 
gence batteries (Terman, 1 91 7; Terman & Childs, 1 91 2), as well 
as Start (1924), Brener (1940) and Elwood (2001). Thus, based 
on the information included in Table 1 , it can be seen that there 
were 10 normative sample sources included in this investiga- 
tion across 85 years (1923 to 2008). The DSF, DSB, and 
DST sample sizes corresponded to 7077, 6841, and 9770, 
respectively. 

3. Results 

As can be seen in Table 1 , the N-weighted DSF, DSB, and DST 
means (and SDs) corresponded to 6.56 (1.22), 4.88 (1.32), and 
11.44 (NA), respectively. In order to estimate the 95% lower 
and upper bounds associated with the DSF and DSB distribu- 
tions, the DSF and DSB standard deviations were multiplied by 
the standard normal deviate (i.e., 1.96). In the case of DSF, the 
deviation term corresponded to 2.39 (i.e., 1.22 * 1.96). Thus, it 
may be suggested that 95% of the adult population has a verbal 
STMC equal to somewhere between 4.17 and 8.95 objects. In 
the case of DSB, the deviation term corresponded to 2.58 
(i.e., 1 .32 • 1 .96). Thus, it may be suggested that 95% of the adult 
population has a verbal WMC equal to somewhere between 
2.30 and 7.46 objects. 

It can also be observed in Table 1 that there was very little 
variability in the means across time. The DSF, DSB, and DST 
ranges corresponded to 6.30-6.72, 4.80-5.10, and 11.00-12.00, 
respectively. To test the hypothesis that memory span scores 



increased across time, three Pearson correlations were estimat- 
ed between year and the three memory span scores (p values 
estimated via permutation tests). None of the correlations were 
statistically significant: DSF r = .45, p = .270; DSB r = - .57, 
p = .124; DST r = -.06, p = .880. Thus, as the estimated 
correlations were non-significant and differentially directed, the 
hypothesis that memory span scores would evidence mean 
level increases across time was unsupported (see Fig. 1). 

Finally, the possibility of ceiling effects associated with the 
Digit Span subscale scores was also examined. As the DSF 
mean of 6.56 was approximately two standard deviations less 
than the maximum possible score of 9, and the DSB mean of 
4.88 was approximately 2.5 standard deviations less than the 
maximum possible score of 8, it was considered unlikely that 
the DSF and DSB subtest scores suffered from substantial 
ceiling effects. In fact, with respect to the highest performing 
normative group across the WAIS-R, WAIS-Ill, and WAIS-IV 
(20-24 years of age), the percentage of participants who scored 
the maximum DSF score (i.e., 9 digits) was equal to 9.5%, 7.0%, 
and 11.0%, respectively. With respect to DSB, 8.5%, 7.0%, and 
3.5% of the participants within the WAIS-R, WAIS-Ill, and WAIS- 
IV normative samples achieved the highest score (8 digits), 
respectively. Thus, although there was a small ceiling effect in 
the data, it was neither substantial, nor was there an increasing 
trend across time, supporting further the absence of a Flynn 
effect. 

4. Discussion 

This investigation had two purposes: (1) to estimate 
precisely the typical verbal STMC and verbal WMC of adults, 
and (2) to determine whether these capacities have increased 
across time, in line with the Flynn effect The results of this 
investigation suggest that the typical adult has a verbal STMC of 
6.56 objects (plus or minus 2.39), and a verbal WMC of 4.88 
objects (plus or minus 2.58). Secondly, in contrast to fluid 
intelligence test scores, STMC and WMC test scores do not 
appear to have increased across time. 

Based on the results of this investigation. Miller's (1956) 
proposal that the typical STMC of an adult is approximately 
seven objects was largely supported in this investigation, as 
the mean DSF score was estimated at 6.56: thus, somewhere 
between six and seven objects. However, if Miller's law (7 ± 2) 
implies that approximately 95% of individuals' STMC fall 
somewhere between five and nine, it would imply that STMC 
was associated with a standard deviation of approximately 1 
(i.e., 1 * 1.96). The results of this investigation suggest that the 
standard deviation is only somewhat larger at 1.22, which 
implies that 95% of the population's STMC lies somewhere 
between 2.39 (i.e., 1.22 * 1.96) above and below the mean 
of 6.56 objects. Thus, in rounded terms. Miller's law may be 
considered largely accurate, at least in the context of verbal 
STMC. 

From a coefficient of variation perspective (SD / M), STMC 
was associated with a value of .19, which, although on the lower 
side, is roughly comparable to other cognitive capacities. For 
example, based on the WAIS-IV normative sample means and 
standard deviations reported in Beaujean and Sheng (2014), 1 
calculated the mean coefficient of variation associated with nine 
of the WAIS-IV subtests (45-54 year olds; N = 200) to be .27 
(range: .21 to .36). Additionally, based on the normative sample 
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Fig. 1. Scatter plot of Digit Span Total, Digit Span Fomaid, and Digit Span Backward means across time (1923-2008). 



means and standard deviations (18-30 year olds) associated 
with tiie BIVIIPB (Oddy et al., 2007) reported in Baxendale 
(2010), list recall and design recall were calculated by me to 
be associated with coefficients of variation of .22 and .24, 
respectively. Strictly speaking. Miller's law implies a coefficient 
of variation of .14 ( 1 / 7), which may be suggested to be rather 
low, in comparison to the results reported in this investigation 
and other cognitive capacities. Thus, the somewhat larger 
estimate of variability in STMC reported in this investigation, 
in comparison to that implied by Miller (1956), helps bring 
STMC closer in line with other cognitive capacities. 

Cowan's (2005) proposal of a typical WMC of four objects 
appears to be an underestimate by approximately one object, as 
this investigation estimated a DSB mean of 4.88, or five rounded. 
As per STMC, it would be useful to replicate the estimate of 
five objects on large, representative samples and a diversity of 
measures (spatial, non-numeric, etc.). Even more so than Miller 
(1956), Cowan (2005) appears to have underestimated the 
amount of variability in WMC in the adult population, as DSB 
was associated with a standard deviation of 1.32 (95% normal 
deviation term =2.58, or three rounded), rather than the 
standard deviation of .50 implied by Cowan's (2005) proposal of 
plus or minus one object (i.e., .50 * 1.96). Thus, in light of the 
results of this investigation. Cowan's law of WMC may be more 
accurately restated as 5 ± 3. Arguably, this is a relatively 
substantial re-statement; again, one which should be verified on 
a diversity of WMC measures. 

From a coefficient of variation perspective (i.e., SD / M), it 
would appear that WMC is a cognitive process associated with 
substantially more variability than STMC. In fact, DSB was 
associated with a 42% larger coefficient of variation than DSF 
(.19 vs. .27). Superficially, it may be suggested that greater 
variability may be expected for DSB, as DSB is a more difficult 



test than DSF. Based, on the WAIS-IV results reported in Table C. 
4 of the technical manual (Wechsler, 2008), the mean item 
difficulties associated with DSF and DSB were calculated by me 
to be p = .70 and p = .53, respectively. Thus, DSB does appear 
to be somewhat more difficult from a pure psychometric 
perspective. Theoretically, WMC involves the application of 
two principal cognitive processes, encoding and transforma- 
tion, rather than simply encoding (Oberauer, Lewandowsky, 
Farrell, Jarrold, & Greaves, 2012). Consequently, it may be 
suggested that the greater amount of variability associated 
with DSB implies that individual differences in the capacity to 
perform both processes (encoding and transformation) are not 
correlated perfectly. Further support for such a position is 
reflected in the fact that DSF and DSB are only moderately 
correlated at r = .55, based on the WAIS-IV normative sample 
(Wechsler, 2008). Even after disattenuation for imperfect 
reliability (DSF a = .81; DSB a = .82), the disattenuated 
correlation (r = .67) is far from unity. Thus, arguably, the key 
distinction between DSB and DSF is not simply that DSB is more 
difficult; instead, there appears to be a qualitative distinction, 
as well (Huiistone, Hitch, & Baddeley, 2013). 

It was hypothesized that memory span would be affected by 
the Flynn effect, as memory span (WMC in particular) is very 
closely related to fluid intelligence (Gignac, 2014; Kane et al., 
2005). The results of this investigation failed to support the 
hypothesis that memory span would be affected by the Flynn 
effect. Overall, there were no meaningful changes in memory 
span from 1923 to 2008, as measured by DSF, DSB, and DST test 
scores. In contrast to memory span, substantial increases have 
been reported for fluid intelligence, particularly as measured by 
Raven's Progressive Matrices (Flynn, 2012). The lack of Flynn 
effect associated with STMC and WMC may be considered 
surprising, considering memory span is so intimately related 
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with fluid intelligence (Chuderski, 2013; Colom, Abad, Quiroga, 
Shih, & Flores-Mendoza, 2008; Colom, Abad, Rebollo, & Shih, 
2005; Kane et al., 2005). Thus, if the Flynn effect is not 
operating predominantly on g (te Nijenhuis & van der Flier, 
2013), and it is not operating on STMC or WMC, the contention 
that fluid intelligence test scores are increasing substantially 
across time is arguably difficult to reconcile. Based on the 
WAIS-IV normative sample, Gignac and Watkins (2013) 
estimated that the amount of unique internal consistency 
reliability associated with the Perceptual Reasoning index 
scores (similar to fluid intelligence) was approximately .18. 
Thus, once g, WMC, and STMC variance is removed from fluid 
intelligence like test scores, there is only a small amount of 
reliable variance upon which to effect substantial, system- 
atic changes of any sort. 

Noteworthy, however, is the item-level research which 
suggests that the Flynn effect associated with Raven's scores 
may be due principally to cohort differences in the capacity for 
abstraction (Fox & Mitchum, 2013). Based on the results of this 
investigation, it would appear that any possible increases in 
abstraction capacity across time have occuiTed completely 
devoid of any increases in WMC. Of course, statistically, it is 
possible that two subtests may be obsereed to be associated 
with a substantial inter-correlation across two cohorts, but only 
one subtest evidence increases across time (Flynn, 2007). 
However, given that the large association between WMC and 
fluid intelligence is theorised to be, at least partly, causal in 
nature (e.g., Halford, Cowan, & Andrews, 2007), the obsereation 
of a Flynn effect for only fluid intelligence may be suggested 
to be improbable. Nonetheless, it should be acknowledged that 
the substantial, but currently non-experimentally established, 
association between WMC and fluid intelligence does not 
necessitate a FE across both constructs (Flynn, 2007). 

In fact, it is possible that the results of this investigation may 
be considered in line with the contention that the Flynn effect is 
operating primarily at the level of abstraction ability (Flynn, 
2012; Fox & Mitchum, 2013), rather than on a test such as Digit 
Span, as Digit Span is based on stimuli to which individuals 
85 years ago and today would have about an equal amount of 
exposure, i.e., digits from one to nine. Such a contention may 
be considered ostensibly plausible, however, when examined 
thoroughly, one would draw the conclusion that humans have 
been exposed to digits at a substantially increasing rate across 
time. First, consider that, by 1930, only 12% of residents of 
New York had access to a telephone at home. By 1960, the 
percentage had increased to 76% (U.S. Bureau of Labor 
Statistics, 2006). In 2011, 89% of US households had a cellular 
phone and 71% a landline (U.S. Census Bureau, 2013a). 
Furthermore, in 1997, 18% of US residents had access to the 
internet at home and, by 2007, the number increased to 62% 
(U.S. Census Bureau, 2013). Thus, phone numbers, login 
numbers, personal identification numbers, digital clocks, digital 
odometers, cable networks with 100 s of channels, online stock 
broking accounts, etc., the typical person today is very likely 
using digits at a rate astonishingly greater to that of the typical 
person in the 1920s, the oldest data point used in this 
investigation. 

Finally, it will be noted that there were essentially no 
changes in adult abstraction ability based on the Similarities 
subtest (a measure of verbal abstraction; Weiss, Saklofske, 
Coalson, & Raiford, 2010) from the W-B (Wechsler, 1939) to 



the WAIS (Wechsler, 1955), perhaps the only two editions that 
allow for valid Similarities subtest comparisons in adults.^ In 
addition to this investigation, there are others that have either 
failed to observe a Flynn effect or have obsereed a reversal of 
the Flynn effect (e.g., Shayer & Ginsburg, 2009; Sundet, Barlaug, 
& Torjussen, 2004; Teasdale & Owen, 2008). Ultimately, how 
the results associated with this investigation should be 
integrated within the FE literature may be debatable, a debate 
which will not be resolved definitively here. Further analysis 
and synthesis is of course encouraged. 

From a methodological perspective, it will be noted that 
many of the published studies supportive of the Flynn effect used 
indirect and possibly unsubstantiated quantitative methods. For 
example, Parker (1986) made use of the slope associated with 
time (in years) and IQfor the Stanford-Binet (Terman & Merrill, 
1973) and applied it to the estimation of difference scores from 
individuals who completed different Wechsler scales (i.e., WAIS 
and WAIS-R). In another case, Beaujean and Sheng (2014) did 
not have access to the raw data, consequently, they estimated 
the standard deviations associated with the Wechsler subscales 
by identifying the raw score equivalents associated with scaled 
scores of 7 and 13. Finally, in the context of evaluating Raven's 
IQ score gains in the Dutch from 1952 to 1982, Flynn (1987) 
reported the percentage of men who answered 24 items or more 
correctly and applied a method with several assumptions to 
estimate the changes in terms of IQ scores. Arguably, these 
methods are not ideal and/or particularly straightfoii/vard. By 
contrast, a strength of this investigation is that the means and 
standard deviations were obtained directly, and the methods 
used to analyse the data and report the results were simple and 
straightforward. 

There are, naturally, limitations associated with this 
investigation. In particular, the standard normal deviate terms 
estimated for STMC (±2.39) and WMC (±2.58) assume that the 
DSF and DSB scores were normally distributed. It is highly 
unlikely that they were perfectly so, as most cognitive ability test 
scores are skewed to some degree (Micceri, 1989). Consequent- 
ly, the estimates reported in this investigation are accurate the 
extent that the distributions were not very substantially skewed. 
Access to raw data would allow for even more precise estimates 
than those reported in this investigation. 

There were also slight ceiling effects in the data. Specifically, 
approximately 5-10% of the participants recalled correctly the 
largest series of digits associated with the DSF (i.e., 9) and the 
DSB (i.e., 8) subtests. Thus, the mean STMC and WMC values 
reported in this investigation are likely underestimates to a 
small degree. For the same reason, the variability in STMC 
and WMC scores reported in this investigation may also be 
expected to be underestimates to a small degree. Given the 



= The Similarities subtest witiiin tlie W-B (Weclisler, 1939) and the WAIS 
(Wechsler, 1955) consisted of 12 and 13 items, respectively. Two of the items 
within the WAIS were completely revised and one additional item was 
included. According to Wechsler (1945. p. 188), a raw Similarities score of 12 
corresponded to a scaled score of 10 in the W-B (ages 17 to 70). Based on the 
age-grouped (ages 16 to 69) raw score and scaled score equivalents published 
in Wechsler ( 1 955), 1 calculated that a scaled score of 1 0 corresponded to an N- 
weighted mean raw score of 12.99. Thus, the mean Similarities raw score 
appears to have increased by one point from the W-B to the WAIS. However, 
given the extra item added to the WAIS, it would be plausible to suggest that 
there was no meaningful change in verbal abstraction ability in adults from 
1939 to 1955. 
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relatively small amount of time it takes to administer Digit 
Span, it would be arguably beneficial for the Wechsler scales to 
include a 10 digit series and a nine digit series within the DSF 
and DSB subtests, respectively. 

Much of the validity of the results reported in this 
investigation rests upon the contention that Digit Span is at 
least a decent indicator of intellectual functioning. Some would 
question such a contention (e.g., Matarazzo, 1972). Although 
certainly not the best indicator of intellectual functioning, 1 
believe the empirical evidence reviewed in the introduction 
above suggests that Digit Span, and Digit Span Bacl<ward in 
particular, is a good indicator of g and a strong correlate of fluid 
intelligence (Gignac, 2014). The Digit Span subtest was chosen 
because memory span has been relatively neglected in the 
Flynn effect literature, as well as because it afforded the best 
opportunity to evaluate test score changes across time from a 
subtest that has changed little over the years. 

It should also be acknowledged that the WMC and fluid 
intelligence research has been conducted primarily at the 
latent variable level, however, this investigation was conducted 
at the observed score level, which is compromised, to some 
degree, by measurement error. Ideally, the hypothesis of the 
Flynn effect would be examined within the context of latent 
variable modelling, as measurement error would be held 
constant across all comparisons (i.e., 0). However, the use 
of latent variable modelling in this context rests upon the 
assumption of factorial invariance. The published research to- 
date suggests that this is an implausible assumption (Must 
et al, 2009; Wicherts et al., 2004). It remains a possibility that 
an invariant latent variable could be created from several 
memoiy span tasks, rather than a whole intelligence battery. As 
the WAIS-IV includes three memory span tasks (DSF, DSB, and 
DSS), once the WAIS-V is published, it may be a possibility to 
test the hypothesis tested in this investigation at the latent 
variable level. However, it would be done so within a relatively 
short span of years. 

Finally, the samples included in this investigation were 
drawn exclusively from the USA, as it proved to be the country 
with the largest number of good quality samples available for the 
purposes of examining the questions raised in this investigation. 
It is possible that the Flynn effect may be observed for memory 
span scores in other nationalities. Researchers are encouraged 
to explore this possibility, providing sufficiently good quality 
sources of data can be identified. Similarly, an extension of this 
investigation on samples of data from children may prove 
enlightening. However, it would appear that there would be 
fewer good quality samples available for inclusion in such an 
investigation. For example, the 'longest digit span forward' and 
'longest digit span backward' means and standard deviations 
associated with the WISC-R (Wechsler, 1974) were not 
published, to my knowledge. Based on the WISC-llI (Wechsler, 
1991) and WISC-IV (Wechsler, 2003) normative samples, there 
were virtually no changes in mean LDSF and LDSB values. 

In conclusion, it is commonly stated that the accumulated 
empirical results suggest that intelligence test scores have 
increased by approximately three IQpoints per decade ( Neisser 
et al., 1996; Nisbett et al., 2012). Such evidence is occasionally 
used in the academic press (e.g., Flynn, 2007; Stanovich, 2011) 
and in the popular press (e.g., Gladwell, 2007; HoUoway, 1999; 
Murdoch, 2007) to support the position that conventional 
intelligence test scores are of questionable validity as indicators 



of intelligence. However, given that verbal STMC and verbal 
WMC test scores do not appear to have increased in the last 
85 years, crystallised intelligence test scores only minimally 
or inconsistently (Flynn, 2007; Lynn, 1990, 2009), and that 
changes in subtest items/scoring/administration across edi- 
tions may explain a large percentage of several subtest score 
mean changes across time (Kaufman, 2010), it may be prudent 
to acknowledge that the magnitude, pervasiveness, and true 
nature of the Flynn effect remains a substantially open question. 
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